Data Cleaning

There are too many useless and redundant variables in this raw dataset. In this project, we first select the useful columns as below.

BATHRM - Number of Full Bathrooms
HF_BATHRM - Number of Half Bathrooms (no bathtub or shower)
HEAT - Heating
AC - Cooling
NUM_UNITS - Number of Units
ROOMS - Number of Rooms
BEDRM - Number of Bedrooms
AYB - The earliest time the main portion of the building was built
YR_RMDL - Year structure was remodeled
EYB - The year an improvement was built more recent than actual year built
STORIES - Number of stories in primary dwelling
SALEDATE - Date of most recent sale
PRICE - Price of most recent sale
QUALIFIED - Qualified
SALE_NUM - Sale Number
GBA - Gross building area in square feet
BLDG_NUM - Building Number on Property
STYLE - Style
STRUCT - Structure
GRADE - Grade
CNDTN - Condition
EXTWALL - Extrerior wall
ROOF - Roof type
INTWALL - Interior wall
KITCHEN - SNumber of kitchens
FIREPLACES - Number of fireplaces
LANDARE - ALand area of property in square feet
WARD - Ward (District is divided into eight wards, each with approximately 75,000 residents)

Secondly, apart from the NA value, there are also some “No Data” strings in this dataset (e.g. in the columns HEAT and GRADE as printed). After dropping all these NA values, we finally get the cleaned dataset with 33165 rows and 27 columns.

## [1] "HEAT"
## [1] "GRADE"

SMART Question: How to predict the condition of a property?

Apart from the price, the condition of a property is always a determinant factor we will concern when we plan to buy a property. In this section, our purpose is going to predict the condition of a particular property without seeing the photos of it. Here, we can hardly get the specific features of a property that show the condition of this property. In other words, we cannot find the causality between some features and the condition of a property. Thus, in this report, we just try to predict a property tends to be in better condition through some general features which can be easily collected.

Definition of condition.

Analogous to the problem in the ranking method (e.g. customer reviews in Amazon), the assessment of the condition is varied from different individuals. Hence, there is no uniform rule to classify properties into differernt levels of condition. Here is a detailed explanation of condition from Marshall & Swift Condition Assessment (page E-6).

Excellent Condition - All items that can normally be repaired or refinished have recently been corrected, such as new roofing, paint, furance overhaul, state of the art components, etc. With no functional inadequacies of any consequence and all major short-lived components in like-new condition, the overall effective age has been substantially reduced upon complete revitilization of the structure regardless of the actual chronological age.

Very Good Condition - All items well maintained, many having been overhauled and repaired as they’ve showed signs of wear, increasing the life expectancy and lowering the effective age with little deterioration or obsolesence evident with a high degree of utility.

Good Condition - No obvious maintenance required but neither is everything new. Appearance and utility are above the standard and the overall effective age will be lower than the typical property.

Average Condition - Some evidence of deferred maintenance and normal obsolescence with age in that a few minor repairs are needed along with some refinishing. But with all major components still functional and contributing toward an extended life expectancy, effective age and utility is standard for like properties of its class and usage.

Fair Condition (Badly worn) - Much repair needed. Many items need refinishing or overhauling, deferred maintenance obvious, inadequate building utility and services all shortening the life expectancy and increasing the effective age.

Poor Condition (Worn Out) - Repair and overall needed on painted surfaces, roofing, plumbing, heating, numerous functional inadequacies, substandard utilities etc. (found only in extraordinary circumstances). Excessive deferred maintenance and abuse, limited value-in-use, approaching abandonment or major reconstruction, reuse or change in occupancy is imminent. Effective age is near the end of the scale regardless of the actual chronological age.

A glance at the condition

From the distribution of the number of properties with respect to different conditions, it looks like a lanky normal distribution, which is reasonable. Over 99% of properties are in “Average”, “Good”, and “Very Good” condition. Therefore, it may cause some problems (will discuss later) to predict the conditon of other three levels, which is “Poor”, “Fair”, and “Excellent.”

## 
##      Poor      Fair   Average      Good Very Good Excellent 
##        11        87      9038     19661      4277        91

Binomial prediction [with simplification]

For simplicity, we first try to distinguish whether a property is above or below average condition. In other words, we trivially split the condition into two levels, “<= Average” (including “Poor”, “Fair”, “Average”) and “> Average” (including “Good”, “Very Good”, “Excellent”).

## 
##      Poor      Fair   Average      Good Very Good Excellent 
##        11        87      9038     19661      4277        91
## 
## <= Average  > Average 
##       9136      24029

LASSO logistic regression [feature selection]

As the condition grouped into 2 levels, we can apply the logistic regression to solve this binomial prediction problem. However, unfortunately, it does not select out a small group of variables when the lambda is within 1 standard error (over 48 features).

## Loaded glmnet 2.0-16

Thus, we try to choose the model with 8 features in 5 standard error away the best model. Athough selecting a model outside 1 standard error will lead to somewhat bias, it performs well in the prediction of test data. As the prediction accuracy in the best model is 80.88%, this simplified model has a pretty good accuracy of 79.89%.

## [1] 0.7988827

As for the confusion matrix of the prediction result, this model performs better in predicting a property with above average condition. Here, in the test data, the accuracy is 80.9% when a property is predicted as above average condition. Besides, 94.5% above-average properties are correctly predicted in the test data.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  11277 
## 
##  
##                   | observed.classes 
## predicted.classes |         0 |         1 | Row Total | 
## ------------------|-----------|-----------|-----------|
##                 0 |      1297 |       446 |      1743 | 
##                   |     0.744 |     0.256 |     0.155 | 
##                   |     0.416 |     0.055 |           | 
##                   |     0.115 |     0.040 |           | 
## ------------------|-----------|-----------|-----------|
##                 1 |      1822 |      7712 |      9534 | 
##                   |     0.191 |     0.809 |     0.845 | 
##                   |     0.584 |     0.945 |           | 
##                   |     0.162 |     0.684 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |      3119 |      8158 |     11277 | 
##                   |     0.277 |     0.723 |           | 
## ------------------|-----------|-----------|-----------|
## 
## 

Generalized linear model and evaluation

LASSO selects out 8 variables (“HEAT”,“AC”,“AYB”,“YR_RMDL”,“EYB”,“PRICE”,“QUALIFIED”,“SALE_NUM”), then we take the 2-round feature selection through the best subset GLM. Then, “HEAT” is also moved out. In fact, “HEAT” does not contribute so much in the prediction.

##  [1] "BATHRM"     "HF_BATHRM"  "HEAT"       "AC"         "NUM_UNITS" 
##  [6] "ROOMS"      "BEDRM"      "AYB"        "YR_RMDL"    "EYB"       
## [11] "STORIES"    "PRICE"      "QUALIFIED"  "SALE_NUM"   "GBA"       
## [16] "BLDG_NUM"   "STYLE"      "STRUCT"     "GRADE"      "CNDTN"     
## [21] "EXTWALL"    "ROOF"       "INTWALL"    "KITCHENS"   "FIREPLACES"
## [26] "LANDAREA"   "WARD"
## Morgan-Tatar search since family is non-gaussian.
## Note: factors present with more than 2 levels.
##     HEAT            AC            AYB          YR_RMDL       
##  Mode :logical   Mode:logical   Mode:logical   Mode:logical  
##  FALSE:5         TRUE:5         TRUE:5         TRUE:5        
##                                                              
##                                                              
##                                                              
##                                                              
##    EYB            PRICE         QUALIFIED        SALE_NUM      
##  Mode:logical   Mode :logical   Mode :logical   Mode :logical  
##  TRUE:5         FALSE:2         FALSE:1         FALSE:2        
##                 TRUE :3         TRUE :4         TRUE :3        
##                                                                
##                                                                
##                                                                
##    Criterion    
##  Min.   :890.4  
##  1st Qu.:893.2  
##  Median :894.2  
##  Mean   :895.1  
##  3rd Qu.:897.8  
##  Max.   :899.9
## 
## Call:  glm(formula = y ~ ., family = family, data = Xi, weights = weights)
## 
## Coefficients:
## (Intercept)          ACY          AYB      YR_RMDL          EYB  
##      0.6377       0.9609      -0.3714       0.7846       0.7767  
##       PRICE   QUALIFIEDU     SALE_NUM  
##      0.3674      -0.6431       0.2364  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  992 Residual
## Null Deviance:       1180 
## Residual Deviance: 876.4     AIC: 892.4

After the feature selection, we get a simple GLM to predict the condition (above or below average). This model performs quite well in ROC curve where AUC is greater than 0.8.

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
## 
##     ci
## The following object is masked from 'package:glmnet':
## 
##     auc
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Area under the curve: 0.8234

As for the McFadden’s pseudo R^2 value, it has different evaluation criteria comparing wtih the R^2 value. McFadden states “while the R2 index is a more familiar concept to planner who are experienced in OLS, it is not as well behaved as the rho-squared measure, for ML estimation. Those unfamiliar with rho-squared should be forewarned that its values tend to be considerably lower than those of the R2 index…For example, values of 0.2 to 0.4 for rho-squared represent EXCELLENT fit.” If we get such value of a GLM from 0.2 to 0.4, it indicates this model can explain most of the data.

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
##          llh      llhNull           G2     McFadden         r2ML 
## -438.2115623 -590.0975621  303.7719997    0.2573913    0.2619709 
##         r2CU 
##    0.3781437

Prediciton of 6-level condition [without simplification]

We’ve solved the two-level prediction. Now we keep attack the 6-level prediction. Using classification tree is a better way to predict a categorical variable with multi-levels.

(Single) Decision tree

We apply the selected variables in the previous feature selection result to this decision tree. Here, the tree only uses two variables, “EYB” and “PRICE.” From this simple prediction model, it shows an improvement after 1964 and the price over 2.4 million means the property tends to be better. However, there are two defects of this model. First, the total prediction accuracy is only 67.3%. Also, in the confusion matrix, it performs not well in the prediction of “Very Good” condition, which is only 51.1%. Second, this model misses to predict 3 minor levels, which are “Poor”, “Fair”, and “Excellent.”

## Warning: Bad 'data' field in model 'call' (expected a data.frame or a matrix).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

## [1] 0.67314
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  11277 
## 
##  
##                | tree.observed 
## tree.predicted |      Poor |      Fair |   Average |      Good | Very Good | Excellent | Row Total | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##        Average |         0 |        15 |      1479 |       569 |        10 |         0 |      2073 | 
##                |     0.000 |     0.007 |     0.713 |     0.274 |     0.005 |     0.000 |     0.184 | 
##                |     0.000 |     0.577 |     0.478 |     0.085 |     0.007 |     0.000 |           | 
##                |     0.000 |     0.001 |     0.131 |     0.050 |     0.001 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           Good |         1 |        11 |      1612 |      6017 |      1360 |        17 |      9018 | 
##                |     0.000 |     0.001 |     0.179 |     0.667 |     0.151 |     0.002 |     0.800 | 
##                |     1.000 |     0.423 |     0.521 |     0.903 |     0.928 |     0.515 |           | 
##                |     0.000 |     0.001 |     0.143 |     0.534 |     0.121 |     0.002 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##      Very Good |         0 |         0 |         1 |        74 |        95 |        16 |       186 | 
##                |     0.000 |     0.000 |     0.005 |     0.398 |     0.511 |     0.086 |     0.016 | 
##                |     0.000 |     0.000 |     0.000 |     0.011 |     0.065 |     0.485 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.007 |     0.008 |     0.001 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##   Column Total |         1 |        26 |      3092 |      6660 |      1465 |        33 |     11277 | 
##                |     0.000 |     0.002 |     0.274 |     0.591 |     0.130 |     0.003 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

Random Forest

To improve the prediction accuracy of the single tree model, we try to apply the random forest algorithm. The prediction accuracy works better in this improved model, but it still misses 3 minor levels.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  11277 
## 
##  
##                | tree.observed 
## tree.predicted |      Poor |      Fair |   Average |      Good | Very Good | Excellent | Row Total | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##        Average |         1 |        22 |      1612 |       601 |         9 |         0 |      2245 | 
##                |     0.000 |     0.010 |     0.718 |     0.268 |     0.004 |     0.000 |     0.199 | 
##                |     1.000 |     0.846 |     0.521 |     0.090 |     0.006 |     0.000 |           | 
##                |     0.000 |     0.002 |     0.143 |     0.053 |     0.001 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           Good |         0 |         4 |      1479 |      5961 |      1164 |        16 |      8624 | 
##                |     0.000 |     0.000 |     0.171 |     0.691 |     0.135 |     0.002 |     0.765 | 
##                |     0.000 |     0.154 |     0.478 |     0.895 |     0.795 |     0.485 |           | 
##                |     0.000 |     0.000 |     0.131 |     0.529 |     0.103 |     0.001 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##      Very Good |         0 |         0 |         1 |        98 |       292 |        17 |       408 | 
##                |     0.000 |     0.000 |     0.002 |     0.240 |     0.716 |     0.042 |     0.036 | 
##                |     0.000 |     0.000 |     0.000 |     0.015 |     0.199 |     0.515 |           | 
##                |     0.000 |     0.000 |     0.000 |     0.009 |     0.026 |     0.002 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##   Column Total |         1 |        26 |      3092 |      6660 |      1465 |        33 |     11277 | 
##                |     0.000 |     0.002 |     0.274 |     0.591 |     0.130 |     0.003 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 
## [1] 0.6974373

Limitation

Because the criteria of the decision tree is information gain (Shannon Entropy), it tends to make a decision close to the majority. Here, most of data are condition of good, very good, and average. Thus, this tree will not tend to make a decision on condition of excellent, poor, and fair.
I have tried to sovle this “discrimination” problem, but still in vain. Therefore, we need a model can equally treat all levels in a categorical variable although the training data for each level is not in same size.

SMART Question: Whether the price and sales volume have some systematic pattern over the time period?

In this section, we want to use time series analysis to find some patterns of the price from the history data and make a prediction. We select SALTEDATE, and use the mean value of properties’ price from the past years as the variables

Average price per month

##  [1] "BATHRM"     "HF_BATHRM"  "HEAT"       "AC"         "NUM_UNITS" 
##  [6] "ROOMS"      "BEDRM"      "AYB"        "YR_RMDL"    "EYB"       
## [11] "STORIES"    "SALEDATE"   "PRICE"      "QUALIFIED"  "SALE_NUM"  
## [16] "GBA"        "BLDG_NUM"   "STYLE"      "STRUCT"     "GRADE"     
## [21] "CNDTN"      "EXTWALL"    "ROOF"       "INTWALL"    "KITCHENS"  
## [26] "FIREPLACES" "LANDAREA"   "WARD"

Now we need to make a time series object. Let’s set the frequence-12 for 12 months, starts at 1992 and increases in single increments:
its

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -30436.3  -9113.5  -2642.8    135.1  12406.2  42514.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9471  0.9869  0.9997  1.0002  1.0181  1.0803
##             Jan        Feb        Mar        Apr        May        Jun
## 1992  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 1993  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 1994  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 1995  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 1996  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 1997  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 1998  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 1999  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2000  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2001  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2002  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2003  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2004  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2005  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2006  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2007  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2008  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2009  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2010  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2011  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2012  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2013  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2014  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2015  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2016  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2017  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
## 2018  -5138.842  -2642.783 -30436.308  -3799.698  12406.175  30207.216
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1992  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 1993  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 1994  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 1995  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 1996  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 1997  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 1998  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 1999  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2000  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2001  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2002  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2003  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2004  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2005  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2006  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2007  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2008  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2009  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2010  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2011  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2012  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2013  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2014  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2015  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2016  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2017  42514.012   9446.516 -28633.102 -21540.550  -9113.446   6730.810
## 2018  42514.012

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1992 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 1993 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 1994 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 1995 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 1996 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 1997 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 1998 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 1999 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2000 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2001 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2002 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2003 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2004 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2005 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2006 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2007 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2008 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2009 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2010 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2011 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2012 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2013 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2014 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2015 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2016 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2017 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
## 2018 0.9996520 0.9944291 0.9516181 0.9895590 1.0138775 1.0480446 1.0803478
##            Aug       Sep       Oct       Nov       Dec
## 1992 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 1993 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 1994 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 1995 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 1996 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 1997 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 1998 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 1999 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2000 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2001 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2002 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2003 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2004 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2005 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2006 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2007 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2008 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2009 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2010 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2011 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2012 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2013 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2014 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2015 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2016 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2017 1.0134904 0.9526006 0.9470513 0.9869390 1.0223905
## 2018

## [1] -30436.31
## [1] 42514.01

We can see the min negative adjusted seasonal component is March and the max is July. Then we use HoltWinters function to olt-Winters function to smooth out the data.

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = ts_price)
## 
## Smoothing parameters:
##  alpha: 0.1489657
##  beta : 0.03870238
##  gamma: 0.2672223
## 
## Coefficients:
##             [,1]
## a   866753.64592
## b     1538.57592
## s1   27313.52461
## s2  -16352.23455
## s3  -11329.32159
## s4   -9761.26185
## s5     -54.01588
## s6    8977.73667
## s7   -5795.84472
## s8  -51303.78093
## s9   15816.28432
## s10  51094.08402
## s11  70395.84302
## s12  54358.39314
## [1] 1.414129e+12

Since the SSE value is too high and the two lines seems fairly inconsistent, the time series anlaysis is not quite a good fit for the price.

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = ts_price)
## 
## Smoothing parameters:
##  alpha: 0.1489657
##  beta : 0.03870238
##  gamma: 0.2672223
## 
## Coefficients:
##             [,1]
## a   866753.64592
## b     1538.57592
## s1   27313.52461
## s2  -16352.23455
## s3  -11329.32159
## s4   -9761.26185
## s5     -54.01588
## s6    8977.73667
## s7   -5795.84472
## s8  -51303.78093
## s9   15816.28432
## s10  51094.08402
## s11  70395.84302
## s12  54358.39314

Sales Volumn per month

Now let’s change our object to the sales volume .

It seems that there’s some seasonality. Lets try to Use additive model to decompose the dataset and quantify seasonal compenent.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -33.87399  -7.95399  -3.06079  -0.02138  11.92059  25.80934
##              Jan         Feb         Mar         Apr         May
## 1992 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 1993 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 1994 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 1995 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 1996 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 1997 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 1998 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 1999 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2000 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2001 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2002 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2003 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2004 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2005 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2006 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2007 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2008 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2009 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2010 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2011 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2012 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2013 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2014 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2015 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2016 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2017 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
## 2018 -20.5832212 -33.8739904  -7.9539904  -3.4706571  11.3010096
##              Jun         Jul         Aug         Sep         Oct
## 1992  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 1993  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 1994  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 1995  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 1996  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 1997  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 1998  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 1999  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2000  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2001  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2002  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2003  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2004  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2005  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2006  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2007  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2008  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2009  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2010  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2011  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2012  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2013  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2014  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2015  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2016  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2017  25.8093429  21.9504327  12.5401763  -3.0607853   0.1779968
## 2018  25.8093429  21.9504327                                    
##              Nov         Dec
## 1992  -6.6649519   3.8286378
## 1993  -6.6649519   3.8286378
## 1994  -6.6649519   3.8286378
## 1995  -6.6649519   3.8286378
## 1996  -6.6649519   3.8286378
## 1997  -6.6649519   3.8286378
## 1998  -6.6649519   3.8286378
## 1999  -6.6649519   3.8286378
## 2000  -6.6649519   3.8286378
## 2001  -6.6649519   3.8286378
## 2002  -6.6649519   3.8286378
## 2003  -6.6649519   3.8286378
## 2004  -6.6649519   3.8286378
## 2005  -6.6649519   3.8286378
## 2006  -6.6649519   3.8286378
## 2007  -6.6649519   3.8286378
## 2008  -6.6649519   3.8286378
## 2009  -6.6649519   3.8286378
## 2010  -6.6649519   3.8286378
## 2011  -6.6649519   3.8286378
## 2012  -6.6649519   3.8286378
## 2013  -6.6649519   3.8286378
## 2014  -6.6649519   3.8286378
## 2015  -6.6649519   3.8286378
## 2016  -6.6649519   3.8286378
## 2017  -6.6649519   3.8286378
## 2018

## [1] -33.87399
## [1] 25.80934
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = ts_salenum)
## 
## Smoothing parameters:
##  alpha: 0.2554623
##  beta : 0.005922093
##  gamma: 0.4265576
## 
## Coefficients:
##            [,1]
## a   202.2356681
## b     0.4984436
## s1   37.1671414
## s2    8.4132978
## s3   11.7017848
## s4    3.6140869
## s5    8.5262782
## s6  -35.8421207
## s7  -64.5330057
## s8    7.6837678
## s9   15.0345282
## s10  66.2757638
## s11  66.0269946
## s12  -6.0070598
## [1] 132759.8

The sum of squared errors of predication (SSE) is too big, which indicates the time series forecast does not fit well. we can plot the original values and the forecasting values on one chart, black is the original and red is the predicted values.
But we are happy that the forecast for recent years seems fit well. So lets make a specific time series forecast for the recent 6 years.

From 2013 to 2018

From the plot, we can see there seems to be a some clear seasonality that are consistent over time, so we could do those procedures again.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -80.59097 -16.21597  -2.48681   0.00275  34.80486  61.34653
##              Jan         Feb         Mar         Apr         May
## 2013 -52.7118056 -80.5909722 -16.2159722  -2.4868056  34.8048611
## 2014 -52.7118056 -80.5909722 -16.2159722  -2.4868056  34.8048611
## 2015 -52.7118056 -80.5909722 -16.2159722  -2.4868056  34.8048611
## 2016 -52.7118056 -80.5909722 -16.2159722  -2.4868056  34.8048611
## 2017 -52.7118056 -80.5909722 -16.2159722  -2.4868056  34.8048611
## 2018 -52.7118056 -80.5909722 -16.2159722  -2.4868056  34.8048611
##              Jun         Jul         Aug         Sep         Oct
## 2013  61.3465278  56.0381944  19.3798611  -8.3201389  -0.1784722
## 2014  61.3465278  56.0381944  19.3798611  -8.3201389  -0.1784722
## 2015  61.3465278  56.0381944  19.3798611  -8.3201389  -0.1784722
## 2016  61.3465278  56.0381944  19.3798611  -8.3201389  -0.1784722
## 2017  61.3465278  56.0381944  19.3798611  -8.3201389  -0.1784722
## 2018  61.3465278  56.0381944                                    
##              Nov         Dec
## 2013 -11.6368056   0.5715278
## 2014 -11.6368056   0.5715278
## 2015 -11.6368056   0.5715278
## 2016 -11.6368056   0.5715278
## 2017 -11.6368056   0.5715278
## 2018

## [1] -80.59097
## [1] 61.34653

We can see the min negative adjusted seasonal component is Febuary and the max positive adjusted seasonal component is June. Then we still use HoltWinters() function to smooth out our data and make a forecast.

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = ts_salenum_2013)
## 
## Smoothing parameters:
##  alpha: 0.2047107
##  beta : 0
##  gamma: 0.6408291
## 
## Coefficients:
##            [,1]
## a   222.0120899
## b     0.7565559
## s1   32.2308793
## s2    4.1331619
## s3    9.0735874
## s4   -0.4566040
## s5   -4.0801846
## s6  -42.9989697
## s7  -82.0075402
## s8   -4.6304142
## s9    2.3630781
## s10  63.9622132
## s11  53.1728087
## s12 -55.5462750

alpha=0.2, means the influence weight of recent observations is small. beta=0, means the slope of the trend remains constant throught the whole time series. gamma=0.64, means seasonal partial predictions are based on both the recent observations and hitory observations.

## [1] 83028.09

And the SSE become smaller.As the plot shows, the time series forecast is more consistant with the orignal observations.

Let’s make a prediction for the next 12 months. The next peak value is predicted to be in the middle of 2019 while the valley value is predicted to be at the begining of 2019. Also, there will be a slump after the peak.

SMART Question: How to predict the price by K nearest neighbor?

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
##  [1] "X.1"                "BATHRM"             "HF_BATHRM"         
##  [4] "HEAT"               "AC"                 "NUM_UNITS"         
##  [7] "ROOMS"              "BEDRM"              "AYB"               
## [10] "YR_RMDL"            "EYB"                "STORIES"           
## [13] "SALEDATE"           "PRICE"              "QUALIFIED"         
## [16] "SALE_NUM"           "GBA"                "BLDG_NUM"          
## [19] "STYLE"              "STRUCT"             "GRADE"             
## [22] "CNDTN"              "EXTWALL"            "ROOF"              
## [25] "INTWALL"            "KITCHENS"           "FIREPLACES"        
## [28] "USECODE"            "LANDAREA"           "GIS_LAST_MOD_DTTM" 
## [31] "SOURCE"             "CMPLX_NUM"          "LIVING_GBA"        
## [34] "FULLADDRESS"        "CITY"               "STATE"             
## [37] "ZIPCODE"            "NATIONALGRID"       "LATITUDE"          
## [40] "LONGITUDE"          "ASSESSMENT_NBHD"    "ASSESSMENT_SUBNBHD"
## [43] "CENSUS_TRACT"       "CENSUS_BLOCK"       "WARD"              
## [46] "SQUARE"             "X"                  "Y"                 
## [49] "QUADRANT"
## 
## Attaching package: 'class'
## The following objects are masked from 'package:FNN':
## 
##     knn, knn.cv
## LATITUDELANDAREAYR_RMDLGRADEWARDASSESSMENT_NBHD
##  kNN_acc: 0.5080623  k value: 6 kNN_acc: 0.5059077  k value: 7 kNN_acc: 0.5031971  k value: 8 kNN_acc: 0.5043091  k value: 9 kNN_acc: 0.5041701  k value: 10
##    
## k      1    2    3    4
##   1 1918  764  434  301
##   2  769 1449  642  298
##   3  519  899 1678  858
##   4  378  461  811 2209
## [1] 14388
## [1] 0.5041701
## 
##  3 : 
## Regression tree:
## rpart(formula = PRICE ~ BATHRM + ROOMS + LANDAREA + LATITUDE + 
##     LONGITUDE + BATHRM + ROOMS + LANDAREA + FIREPLACES + YR_RMDL + 
##     GRADE + CNDTN + ASSESSMENT_NBHD, data = ward, method = "anova")
## 
## Variables actually used in tree construction:
## [1] LATITUDE YR_RMDL 
## 
## Root node error: 3.4086e+17/9955 = 3.424e+13
## 
## n= 9955 
## 
##        CP nsplit rel error  xerror     xstd
## 1 0.29866      0   1.00000 1.00026 0.088120
## 2 0.26259      2   0.40268 0.51575 0.046745
## 3 0.01000      3   0.14009 0.14841 0.027584
## 
##  2 : 
## Regression tree:
## rpart(formula = PRICE ~ BATHRM + ROOMS + LANDAREA + LATITUDE + 
##     LONGITUDE + BATHRM + ROOMS + LANDAREA + FIREPLACES + YR_RMDL + 
##     GRADE + CNDTN + ASSESSMENT_NBHD, data = ward, method = "anova")
## 
## Variables actually used in tree construction:
## [1] BATHRM    CNDTN     GRADE     LANDAREA  LONGITUDE YR_RMDL  
## 
## Root node error: 7.9983e+14/8366 = 9.5605e+10
## 
## n= 8366 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.277166      0   1.00000 1.00020 0.031569
## 2 0.078454      1   0.72283 0.72322 0.026383
## 3 0.023777      2   0.64438 0.64505 0.024473
## 4 0.019313      3   0.62060 0.63149 0.024033
## 5 0.016052      6   0.56266 0.58854 0.018774
## 6 0.015621      7   0.54661 0.57720 0.018583
## 7 0.010000      8   0.53099 0.55108 0.017595
## 
##  7 : 
## Regression tree:
## rpart(formula = PRICE ~ BATHRM + ROOMS + LANDAREA + LATITUDE + 
##     LONGITUDE + BATHRM + ROOMS + LANDAREA + FIREPLACES + YR_RMDL + 
##     GRADE + CNDTN + ASSESSMENT_NBHD, data = ward, method = "anova")
## 
## Variables actually used in tree construction:
## [1] ASSESSMENT_NBHD BATHRM          CNDTN           LANDAREA       
## 
## Root node error: 1.1972e+15/10145 = 1.1801e+11
## 
## n= 10145 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.196204      0   1.00000 1.00029 0.030107
## 2  0.072358      1   0.80380 0.80436 0.030167
## 3  0.070408      2   0.73144 0.72248 0.027881
## 4  0.034886      3   0.66103 0.66247 0.022223
## 5  0.033104      4   0.62614 0.62925 0.021655
## 6  0.023320      5   0.59304 0.59604 0.020256
## 7  0.013649      6   0.56972 0.57321 0.019252
## 8  0.010940      7   0.55607 0.56925 0.019017
## 9  0.010491      8   0.54513 0.55795 0.018692
## 10 0.010000      9   0.53464 0.55152 0.018536
## 
##  6 : 
## Regression tree:
## rpart(formula = PRICE ~ BATHRM + ROOMS + LANDAREA + LATITUDE + 
##     LONGITUDE + BATHRM + ROOMS + LANDAREA + FIREPLACES + YR_RMDL + 
##     GRADE + CNDTN + ASSESSMENT_NBHD, data = ward, method = "anova")
## 
## Variables actually used in tree construction:
## [1] BATHRM    CNDTN     LONGITUDE YR_RMDL  
## 
## Root node error: 3.7421e+14/6375 = 5.87e+10
## 
## n= 6375 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.165473      0   1.00000 1.00051 0.045743
## 2  0.066390      1   0.83453 0.84301 0.044579
## 3  0.057505      2   0.76814 0.76914 0.043544
## 4  0.024391      3   0.71063 0.72054 0.042316
## 5  0.023740      4   0.68624 0.69404 0.042093
## 6  0.022582      5   0.66250 0.67905 0.042912
## 7  0.012376      6   0.63992 0.64854 0.042302
## 8  0.012240      7   0.62754 0.64014 0.042290
## 9  0.011013      8   0.61530 0.62506 0.042264
## 10 0.010000      9   0.60429 0.62334 0.042775
## 
##  4 : 
## Regression tree:
## rpart(formula = PRICE ~ BATHRM + ROOMS + LANDAREA + LATITUDE + 
##     LONGITUDE + BATHRM + ROOMS + LANDAREA + FIREPLACES + YR_RMDL + 
##     GRADE + CNDTN + ASSESSMENT_NBHD, data = ward, method = "anova")
## 
## Variables actually used in tree construction:
## [1] LANDAREA LATITUDE ROOMS    YR_RMDL 
## 
## Root node error: 4.3996e+18/8803 = 4.9979e+14
## 
## n= 8803 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.170869      0   1.00000 1.00014 0.061531
## 2 0.156451      2   0.65826 0.69704 0.039667
## 3 0.037539      3   0.50181 0.51568 0.033452
## 4 0.013150      4   0.46427 0.47250 0.028343
## 5 0.010288      7   0.42378 0.45806 0.026004
## 6 0.010000      8   0.41349 0.44841 0.024739
## 
##  5 : 
## Regression tree:
## rpart(formula = PRICE ~ BATHRM + ROOMS + LANDAREA + LATITUDE + 
##     LONGITUDE + BATHRM + ROOMS + LANDAREA + FIREPLACES + YR_RMDL + 
##     GRADE + CNDTN + ASSESSMENT_NBHD, data = ward, method = "anova")
## 
## Variables actually used in tree construction:
## [1] BATHRM    CNDTN     LANDAREA  LONGITUDE ROOMS     YR_RMDL  
## 
## Root node error: 7.5523e+14/6588 = 1.1464e+11
## 
## n= 6588 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.159156      0   1.00000 1.00020 0.044150
## 2 0.068590      1   0.84084 0.84201 0.044991
## 3 0.021643      5   0.56649 0.56738 0.030395
## 4 0.019608      6   0.54484 0.56038 0.030055
## 5 0.013844      7   0.52523 0.54321 0.028563
## 6 0.012245      8   0.51139 0.52007 0.026621
## 7 0.010000      9   0.49915 0.50312 0.025468
## 
##  8 : 
## Regression tree:
## rpart(formula = PRICE ~ BATHRM + ROOMS + LANDAREA + LATITUDE + 
##     LONGITUDE + BATHRM + ROOMS + LANDAREA + FIREPLACES + YR_RMDL + 
##     GRADE + CNDTN + ASSESSMENT_NBHD, data = ward, method = "anova")
## 
## Variables actually used in tree construction:
## [1] LANDAREA LATITUDE ROOMS    YR_RMDL 
## 
## Root node error: 4.4505e+14/4559 = 9.7621e+10
## 
## n= 4559 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.058806      0   1.00000 1.00045 0.37075
## 2 0.020213      6   0.64717 0.96786 0.37745
## 3 0.017139      7   0.62695 0.94712 0.37730
## 4 0.015395      8   0.60981 0.93594 0.37733
## 5 0.010000      9   0.59442 0.92424 0.37730
## 
##  9 : 
## Regression tree:
## rpart(formula = PRICE ~ BATHRM + ROOMS + LANDAREA + LATITUDE + 
##     LONGITUDE + BATHRM + ROOMS + LANDAREA + FIREPLACES + YR_RMDL + 
##     GRADE + CNDTN + ASSESSMENT_NBHD, data = ward, method = "anova")
## 
## Variables actually used in tree construction:
## [1] CNDTN    LANDAREA LATITUDE ROOMS    YR_RMDL 
## 
## Root node error: 1.799e+14/2760 = 6.5181e+10
## 
## n= 2760 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.047072      0   1.00000 1.00084 0.20906
## 2 0.020020      3   0.85879 0.90771 0.20863
## 3 0.015218      8   0.75468 0.80213 0.20142
## 4 0.013995      9   0.73946 0.79516 0.20131
## 5 0.013834     14   0.66730 0.79304 0.20131
## 6 0.011462     15   0.65347 0.78981 0.20112
## 7 0.010000     16   0.64200 0.74984 0.20154
## 
##  3 : LANDAREAYR_RMDLLATITUDEROOMSCNDTNLONGITUDEGRADEBATHRMASSESSMENT_NBHD kNN_acc: 0.6773805  k value: 6 kNN_acc: 0.6830052  k value: 7 kNN_acc: 0.6870229  k value: 8 kNN_acc: 0.6862194  k value: 9 kNN_acc: 0.6826035  k value: 10
##  2 : LANDAREAYR_RMDLLATITUDEROOMSCNDTNLONGITUDEGRADEBATHRMASSESSMENT_NBHD kNN_acc: 0.5702677  k value: 6 kNN_acc: 0.5869981  k value: 7 kNN_acc: 0.582696  k value: 8 kNN_acc: 0.5898662  k value: 9 kNN_acc: 0.5927342  k value: 10
##  7 : LANDAREAYR_RMDLLATITUDEROOMSCNDTNLONGITUDEGRADEBATHRMASSESSMENT_NBHD kNN_acc: 0.5731179  k value: 6 kNN_acc: 0.5813953  k value: 7 kNN_acc: 0.582972  k value: 8 kNN_acc: 0.587702  k value: 9 kNN_acc: 0.5857312  k value: 10
##  6 : LANDAREAYR_RMDLLATITUDEROOMSCNDTNLONGITUDEGRADEBATHRMASSESSMENT_NBHD kNN_acc: 0.5577164  k value: 6 kNN_acc: 0.5526976  k value: 7 kNN_acc: 0.555207  k value: 8 kNN_acc: 0.5495609  k value: 9 kNN_acc: 0.5451694  k value: 10
##  4 : LANDAREAYR_RMDLLATITUDEROOMSCNDTNLONGITUDEGRADEBATHRMASSESSMENT_NBHD kNN_acc: 0.6942299  k value: 6 kNN_acc: 0.7005906  k value: 7 kNN_acc: 0.6933212  k value: 8 kNN_acc: 0.6974103  k value: 9 kNN_acc: 0.6969559  k value: 10
##  5 : LANDAREAYR_RMDLLATITUDEROOMSCNDTNLONGITUDEGRADEBATHRMASSESSMENT_NBHD kNN_acc: 0.5689132  k value: 6 kNN_acc: 0.5731633  k value: 7 kNN_acc: 0.5743777  k value: 8 kNN_acc: 0.5707347  k value: 9 kNN_acc: 0.5798421  k value: 10
##  8 : LANDAREAYR_RMDLLATITUDEROOMSCNDTNLONGITUDEGRADEBATHRMASSESSMENT_NBHD kNN_acc: 0.727193  k value: 6 kNN_acc: 0.7412281  k value: 7 kNN_acc: 0.7289474  k value: 8 kNN_acc: 0.7324561  k value: 9 kNN_acc: 0.7298246  k value: 10
##  9 : LANDAREAYR_RMDLLATITUDEROOMSCNDTNLONGITUDEGRADEBATHRMASSESSMENT_NBHD kNN_acc: 0.7550725  k value: 6 kNN_acc: 0.7681159  k value: 7 kNN_acc: 0.7536232  k value: 8 kNN_acc: 0.7536232  k value: 9 kNN_acc: 0.7449275  k value: 10